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Abstract 


We consider the problem of learning to choose actions using contextual information when provided 
with limited feedback in the form of relative pairwise comparisons. We study this problem in the 
dueling-bandits framework of Yue et al. (2009), which we extend to incorporate context. Roughly, 
the learner’s goal is to find the best policy, or way of behaving, in some space of policies, although 
“best” is not always so clearly defined. Here, we propose a new and natural solution concept, rooted 
in game theory, called a von Neumann winner, a randomized policy that beats or ties every other 
policy. We show that this notion overcomes important limitations of existing solutions, particularly 
the Condorcet winner which has typically been used in the past, but which requires strong and 
often unrealistic assumptions. We then present three efficient algorithms for online learning in 
our setting, and for approximating a von Neumann winner from batch-like data. The first of these 
algorithms achieves particularly low regret, even when data is adversarial, although its time and 
space requirements are linear in the size of the policy space. The other two algorithms require time 
and space only logarithmic in the size of the policy space when provided access to an oracle for 
solving classification problems on the space. 

Keywords: contextual dueling bandits, online learning, bandit algorithms, game theory. 

1. Introduction 

We study how to learn to act based on contextual information when provided only with partial, 
relative feedback. This problem naturally arises in information retrieval (IR) and recommender 
systems, where the user feedback is considerably more reliable when interpreted as relative com¬ 
parisons rather than absolute labels (Radlinski et al., 2008). For instance, in web search, for a 
particular query, the IR system may have several candidate rankings of documents that could be 
presented, with the best option being dependent upon the specific user. By presenting a mix or 
interleaving of two of the candidate rankings and observing the user’s response (Chapelle et al., 
2012; Hofmann et al., 2013), it is possible for such a system to get feedback about user preferences. 

* On leave from Princeton University. 

t Part of this research was conducted during an internship with Microsoft Research. 
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However, this feedback is partial since it is only with respect to the two rankings that were chosen, 
and it is relative since it only tells which of the two rankings is preferred to the other. 

The dueling-bandits problem of Yue et al. (2009) formalizes this setting. Abstractly, the learner 
is repeatedly faced with a set of possible actions, and may select two of these actions to face off in 
a duel whose stochastically determined winner is then revealed. Through such experimentation, the 
learner attempts to find the “best” of the actions. 

In this paper, we focus on the contextual dueling bandit setting, where context can provide 
information that helps identify the best action. For instance, in the example above, the actions may 
be the candidate rankings to choose among, and the context may be additional information about 
the user or query that might help in choosing the best ranking. The learner’s goal now is to find a 
good policy, a rule for choosing actions based on context. 

Similar to prior work on contextual (non-dueling) bandits (Auer et ah, 2002; Langford and 
Zhang, 2007; Dudfk et al., 2011; Agarwal et ah, 2014), we propose a setting in which the learner 
has access to a space of policies n, with the goal of performing as well as the “best” in the space. 
This space plays a role analogous to the hypothesis space in supervised learning. It will typically 
be extremely large or even infinite. We therefore explicitly aim for methods that will be applicable 
when this is the case. 

Merely defining the precise goal of learning can be problematic in such a relative-feedback 
setting. When rewards are absolute, the best policy in n is clearly and easily defined as the one 
that achieves the highest expected reward, because, by such an absolute measure, this policy beats 
every other policy. In a relative-feedback setting, since we have a means of obtaining pairwise 
comparisons between actions or policies, we might aim to choose the policy in n that (on average) 
beats every other policy in the class in such head-to-head competitions. Most previous work on 
dueling bandits (Yue et al., 2009; Yue and Joachims, 2011; Urvoy et al., 2013; Zoghi et al., 2014b) 
has in fact explicitly or implicitly assumed that such a Condorcet winner exists. But there arc good 
reasons to doubt such a strong assumption, particularly when working with large and rich policy 
spaces. There are numerous examples, even in natural situations, where this assumption (and more 
generally, transitivity among policies) is known to fail (see, for instance, Gardner, 1970; Zoghi et al., 
2014a). Indeed, the preferences of a population of users do not need to be transitive, even if each 
individual user has transitive preferences. 

In this paper, we seek to improve the dueling bandits techniques in two respects. First, we 
seek to relax the modeling restrictions on which previous methods have depended so as to develop 
methods that are more generally applicable. Second, we seek to achieve a similar level of flexibility 
in the design of policies as for supervised learning algorithms. 

Contributions. Our first contribution (in Section 2) is the introduction of a new solution concept, 
called the von Neumann winner, which is based on a game-theoretic interpretation. Like a Condorcet 
winner, when facing any other policy in a duel, a von Neumann winner has at least a 50% chance 
of winning; in this sense, a von Neumann winner is at least as good as every policy in the space. 
On the other hand, a von Neumann winner is always guaranteed to exist, without any extraneous 
assumptions. This guarantee is made possible by allowing policies to be selected in a randomized 
fashion, as is quite natural in such a learning setting. 

With the goal of learning clarified, we turn to algorithms. As a warm-up, in Section 5, we give 
a fully online algorithm in which two copies of the Exp4.P multi-armed bandit algorithm (Beygelz- 
imer et al., 2011) are run against one another (using a “sparring” approach previously suggested 
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by Ailon et al., 2014). Although yielding good regret, this algorithm requires time and space linear 
in |II|, which is impractical in most realistic settings where we would expect II to be enormous. 

To address this difficulty, we propose an approach used previously in other works on contextual 
bandits (Langford and Zhang, 2007; Dudfk et al., 2011; Agarwal et al., 2014). Specifically, we 
assume that we have access to a classification oracle for our policy class that can find the minimum- 
cost policy in II when given the cost of each action on each of a sequence of contexts. In fact, an 
ordinary cost-sensitive, multiclass classification learning algorithm can be used for this purpose, 
which suggests that, practically, this may be a reasonable and natural assumption. 

We then consider techniques for constructing a von Neumann winner from empirical exploration 
data. (Although we focus on a batch-like setting, the resulting algorithms can be used online as 
well.) We analyze the statistical efficiency of this approach in Section 6. In Sections 7 and 8, we 
give two polynomial-time algorithms for computing an approximate von Neumann winner from 
data: one based on Kalai and Vempala’s Follow-the-Perturbed-Leader algorithm (2003), and the 
other based on projected gradient ascent as studied by Zinkevich (2003). These techniques yield 
learning algorithms that approximate or perform as well as the von Neumann winner, using data, 
time, and space that only depend logarithmically on the cardinality of the space II, and therefore, 
are applicable even with huge policy spaces. 

Other related work. Numerous algorithms have been proposed for the (non-contextual) dueling 
bandits problem: Interleaved Filter (Yue et al., 2009); Beat the Mean (BTM) (Yue and Joachims, 
201 1); Sensitivity Analysis of VAriables for Generic Exploration (SAVAGE) (Urvoy et al., 2013); 
Relative Confidence Sampling (Zoghi et al., 2014a); Relative Upper Confidence Bound (RUCB) 
(Zoghi et al., 2014b); Doubler, MultiSBM and Sparring (Ailon et al., 2014) and mergeRUCB (Zoghi 
et al., 2015b). These methods impose various constraints on the problem at hand, ranging from 
the requirement that it arise from an underlying utility function (e.g., MultiSBM) to no constraint 
at all (e.g., SAVAGE); they mainly provide regret bounds that are logarithmic in the number of 
rounds, and at least linear in the number of actions. In principle, these methods could be applied to 
contextual dueling bandits by treating policies as actions. But this would lead to regret at least linear 
in the number of policies which is far worse than the logarithmic bounds obtained in this paper. 

The method that is the most closely related to our work is Dueling Bandit Gradient Descent 
(DBGD) (Yue and Joachims, 2009), a policy gradient method that iteratively improves upon the 
current policy by conducting comparisons with nearby policies, assuming that the policy space 
comes equipped with a distance metric, and incrementally adapting the policy if a better alternative 
is encountered. As with all local optimization methods, DBGD imposes a convexity assumption on 
the dueling bandit problem for its performance guarantee: the dueling bandit problem is assumed 
to arise from the noisy observations of an underlying convex objective function. In this paper, we 
both relax the assumptions imposed by DBGD and improve upon the regret bound. 

2. Dueling bandits and the von Neumann winner 

In the dueling bandits problem (Yue et al., 2009), the learner has access to K possible actions, 
1,..., K , and attempts to determine the “best” action through repeated stochastic pairwise compar¬ 
isons of actions, called duels. Thus, at each time step, the learner chooses a pair of actions (a, b) for 
a duel; the outcome of the duel is +1 if a wins, and — 1 if b wins. The (unknown) expected value of 
this outcome is denoted P(o, 6), and is assumed to depend only on the selected pah (a, b). In other 
words, the probability that a beats b in a duel is ( P(a , b) + l)/2, and the two actions are exactly 


3 


Dudik Hofmann Schapire Slivkins Zoghi 


evenly matched if P(a, b) = 0. We say that a beats b to mean that the chance of a winning a duel 
with b is strictly greater than 1/2; similarly, a ties b if this probability is exactly 1/2. 

The K x K matrix P of all such expectations P(a, b) is called the preference matrix} This 
matrix is initially unknown to the learner, but can be discovered bit-by-bit through experimentation. 
We assume, of course, that all of the entries of P are in [—1,+1], and furthermore, that P is 
skew-symmetric, meaning that P T = —P so that a duel ( b, a) is equivalent to (the negation of) a 
duel (a, b). (This also implies P(a, a) = 0 for every action a, as is natural.) Other than this, we 
strenuously avoid making any assumptions in the current work about the matrix P. For instance, 
we do not make any assumptions regarding transitivity among the various actions. 

In such a relative-feedback setting, the “best” action is not always well-defined because there 
is no measure of the absolute quality of actions. Existing work typically assumes the existence of 
a Condorcet winner (Urvoy et ah, 2013; Zoghi et ah, 2014b), that is, an action a* that beats every 
other action a / a*. This is a very natural definition from a preference learning perspective, since a* 
is indeed preferred to every other action. However, it has been shown that dueling bandit problems 
without Condorcet winners arise regularly in practice (Zoghi et ah, 2014a). 1 2 

Although there is no guarantee of a single action beating all others, the situation changes con¬ 
siderably if we simply allow actions to be selected in a randomized fashion. With this natural 
relaxation, the problem of non-existence entirely vanishes. Thus, the idea is to find a probability 
vector w in A k (where A k is the simplex of vectors in [0,1] K whose entries sum to 1) such that 

w(a)P(a, b) > 0 for all actions b. (1) 

In words, for every action b, if a is selected randomly according to distribution w, then the chance 
of beating b in a duel is at least 1/2. (Note that this property implies that the same will be true if 
b is itself selected in a randomized way.) A distribution w with this property is said to be a von 
Neumann winner for the preference matrix P. 

As the name reflects, this notion is intimately connected to a game-theoretic interpretation. 
Indeed, we can view preference matrix P as describing a zero-sum matrix game. In such a game, 
the two players simultaneously choose distributions (or mixed strategies ) w and u over rows and 
columns, respectively, yielding a gain to the row player of w T Pu. According to von Neumann’s 
celebrated minmax theorem, for any matrix P, 

max min w T Pu = min max w T Pu, 

wgA k uGA/c ugAk wgA k 

the common value being the value of the game P. A maxmin strategy w or a minmax strategy u is 
one realizing the max or min on the left- or right-hand side of this equality, respectively. Finding 
these strategies is called solving the game. 

We have assumed that the matrix P is itself skew-symmetric, so the game it describes is a 
symmetric game. Such games are known to have value exactly equal to zero (see, for instance, 
Owen, 1995, Theorem II.6.2). Working through definitions, this means that w is a maxmin strategy 
if and only if min ug A K w T Pu > 0. But this is exactly equivalent to Eq. (1). Therefore: 

Proposition 1 A probability vector w E A k is a von Neumann winner for preference matrix P if 
and only if it is a maxmin strategy for the game P. Consequently, every preference matrix P has a 
von Neumann winner. 

1. In the literature, the preference matrix P often refers to the matrix of probabilities ( P(a,b ) + l)/2. With our 

modification, P(a, b) becomes anti-symmetric around 0, which simplifies the arguments considerably. 

2. See also Appendix A for more compelling evidence that this is indeed the case. 
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Before continuing, we briefly mention some of the other solution concepts that have been pro¬ 
posed to remedy the potential non-existence of a Condorcet winner (Schulze, 201 1). Two of these 
are the Borda winner, the action that has the highest probability of winning a duel against a uni¬ 
formly random action; and the Copeland winner, the action that wins the most pairwise compar¬ 
isons (Urvoy et al., 2013). Both of these fail the independence of clones criterion (Schulze, 2011), 
meaning that adding multiple identical copies of an action can change the Borda or Copeland win¬ 
ner. This criterion is particularly crucial in a dueling bandit setting because a given policy class may 
contain many identical policies. In contrast, the von Neumann winner performs at least as well as 
any individual policy, and is thus unaffected by the presence or absence of clones. See Appendix B 
for a more detailed discussion. Note that if there does happen to exist a Condorcet winner, then it 
will also be the unique von Neumann winner. 

3. Incorporating context 

Next, we consider how the preceding development can be extended to a much more realistic setting 
in which the best way of acting may depend on additional, observable information, called context. 
Thus, prior to choosing actions, the learner is allowed to observe some value x, the context, selected 
by Nature from some unspecified space X. For instance, x might be a feature-vector description 
of a web user. In this setting, the preference matrix is no longer static; rather, which actions are 
better than which others now varies and depends on the context, which therefore must be taken into 
account to fully optimize the choice of actions. 

Formally, we assume that on every round t of the learning process, a context xt and preference 
matrix P t are chosen by Nature. The context xt is revealed to the learner, but the preference matrix 
Pt remains hidden. Based on xt, the learner selects two actions (at, bt) for a duel, whose outcome 
has expectation determined by the current (hidden) preference matrix P/ in the usual way. Except 
where noted otherwise, in this paper, we always assume that each pair (xt, Pt) is chosen at random 
according to independent draws from some unknown joint distribution V. 

The goal is to determine which action to select as a function of the context. Such a mapping 
7 r from contexts x to actions a is called a policy. Typically, we are working with policies of a 
particular form, that is, from some policy space II. For instance, this space might represent the set 
of all decision trees. For simplicity, we assume that II has finite cardinality. However, we generally 
think of n as an extremely large space, exponential in any reasonable measure of complexity. 

The notion of von Neumann winner (as well as other concepts, like Condorcet winner) can 
be extended to incorporate context essentially by reducing to the non-contextual setting. We can 
regard each policy 7r as a “meta-action,” and define a |H| x |H| preference matrix M over these 
meta-actions. Thus, the rows and columns of M are each indexed by policies in n, and 

M(ir,p) = E feP) ^ [P(tt(x), p(x))\. (2) 

This quantity is the expected outcome when a “meta-duel” is held between the two policies tt and 
p, whose stochastic outcome is determined by randomly selecting (x,P) ~ V , and then holding 
an ordinary duel on P between the actions tt(x) and pix). This huge matrix M thus encodes the 
probability of any policy beating any other policy in a duel. 

We can now define von Neumann winner in the contextual setting to be an (ordinary) von Neu¬ 
mann winner for the matrix M (regarded here as a kind of “meta-preference-matrix”). Thus, unrav¬ 
eling definitions, a (contextual) von Neumann winner is a probability distribution W over policies 
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such that for every opposing policy p, if it is chosen at random from W, then the probability that 
7 r beats p in a duel is at least 1/2. That is, the randomized policy defined by W beats or ties every 
policy in the space II. By Proposition 1 (applied to M), such a von Neumann winner must exist. 

For the rest of the paper, we study how to compute (or approximate) contextual von Neumann 
winners. Of course, because the space II and corresponding matrix M are both gigantic, this will 
present significant computational challenges. 

4. Learning scenarios 

We consider two possible learning scenarios. 

In the simpler of these, called explore-then-exploit, we suppose that the learner is allowed to 
explore for some number of rounds m (where, as described above, on each round, the learner is 
presented with a random context and permitted to run and observe the outcome of a duel between 
a pair of actions of its choosing). At the end of these m rounds, the learner outputs a distribution 
W over policies in II. The learner’s goal is to produce W which is an e-approximate von Neumann 
winner, that is, for which 

min W t MU > -e 

UGA|n| 

for some small e > 0. In other words, for all tt G II, W should beat it with probability at least 
1/2 — e/2. Naturally, m should be “reasonable” as a function e. This setting is almost like learning 
from a passively selected batch of training examples, except that the learner has an active role in 
selecting which actions to play in each duel. 

In the alternative full-explore-exploit setting, learning occurs in a fully online manner across T 
rounds (in the manner described earlier), with performance measured using some notion of regret. 
In this paper, where we are working with policies and changing preference matrices, we propose to 
define regret to be 

1 T 

max - y^\P t (7r(xt),a t ) + P t (7r(x t ), b t )]. (3) 

7rGlI Z ‘ ^ 
t =1 

If we can find an algorithm for which this regret is o(T), then eventually the algorithm selects 
actions (at, b t ) which cannot be beaten by any other policy tt G II. 

In the standard dueling-bandits setting with a static preference matrix, a seemingly different def¬ 
inition of regret was used by Yue et al. (2009) in terms of an assumed Condorcet winner. However, 
when specialized to their setting, and when provided with their same assumptions, their definition 
can be shown to be equivalent (up to constant factors) to Eq. (3). 

5. Sparring Exp4.P 

Our goal then is to find, approximate or perform as well as a von Neumann winner, which, as 
we have seen, is a maxmin strategy for a particular game. Under this interpretation, it becomes 
especially natural to use ordinary no-regret learning algorithms as players of this game since it is 
known that such algorithms, when properly configured for this purpose, will converge to maxmin or 
minmax strategies (Freund and Schapire, 1999). The idea is simply to run two independent copies 
of such an algorithm against one another. Such a “sparring” approach was previously proposed for 
dueling bandits by Ailon et al. (2014), though without details, and not in the contextual setting. 
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We consider using the multi-armed bandit algorithm Exp4.P (Beygelzimer et al., 2011) for this 
puipose in the full-explore-exploit setting. Exp4.P is well-suited since it is designed to work with 
partial information as in our bandit setting, and since it can handle the kind of adversarially gener¬ 
ated data that arises unavoidably when playing a game. It also is designed to work with policies in 
a contextual setting like ours (or, more generally, to accept the advice of “experts”). 

The learning setting for Exp4.P is as follows (somewhat, but straightforwardly, modified for 
our present purposes). There are K possible actions, 1 ,,K, and a finite space II of policies. 
On each round t = 1,..., T, an adversary chooses and reveals a context xt, and also chooses, 
but does not reveal rewards ry(l),..., rt(K) £ [—1, +1] for each of the K actions. The learner 
then selects an action a*, and receives the revealed reward r t (at). The learner’s total reward is thus 
Ga = Ylt= l r r(°t)> while the reward of each policy n is = J2t =t J 't( 7r ( a; t))- The learner’s goal 
is to receive reward close to that of the best policy. Beygelzimer et al. (201 1) prove that (subject to 
very benign conditions) with probability at least 1 — 5, Exp4.P achieves reward 

G a > max G n - 12^KTln(\U\/5). (4) 

7rgn 

(This holds for any 5 > 0; the 5 is passed as a parameter to the algorithm.) 

For contextual dueling bandits, we run two separate copies of Exp4.P which are played against 
one another; let us call them row-Exp and column-Exp. We use the same actions, contexts, and 
policies for the two copies as for the original problem. On each round t. Nature chooses a context 
xt and a preference matrix P/ . The context (but not the preference matrix) is revealed to row-Exp 
and column-Exp, which select actions at and bt, respectively. A duel is then held between these two 
actions; the outcome r is passed as feedback to row-Exp (for its chosen action at), and its negation 
—r is similarly passed to column-Exp. We call this algorithm SparringEXP4.P. 

Theorem 2 Consider K actions, policy space II, and time horizon T. Fix parameter 5 > 0. Then 
with probability at least 1 — 5, SparringEXP4.P achieves regret at most 0{\JKT ln(|II|/<5)). 

The proof is in Appendix C. This result holds also for an adversarial environment in which 
the pairs (x t ,P t ) are selected by an adversary rather than at random. Also, we can adapt this 
algorithm for explore-then-exploit learning using the following standard technique for online-to- 
batch conversion. Run SparringEXP4.P for m exploration rounds. In each round i, row-Exp 
internally computes a distribution w; over policies. Then w = (1/m) Y1T=i w * ' s an ^-approximate 
von Neumann winner where e = 0(\f K\n(\H\/5)/m). 

Although yielding very good regret bounds and handling adversaries, this approach requires 
time and space proportional to |TI|, and is therefore not practical for extremely large policy spaces. 

6. Explore-then-exploit algorithms with a classification oracle 

We next begin a development that will lead to efficient methods (in terms of time, space and data) 
for handling even extremely large policy spaces, under a particular assumption discussed below. We 
describe a general approach for exploration, for using the collected data to find a statistically sound 
solution, and for reducing the problem that must be solved to a more tractable form. 

We focus mainly on the explore-then-exploit problem. Thus, we have m exploration rounds, 
and on each round i, a pair (xt, Pi) is selected at random, and the learner is permitted to choose and 
observe the outcome of a single duel (at, bt). Although x % is observed. Pi is not. Here, we propose 
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a simple exploration strategy, called uniform exploration, in which each dueling pair (a*, 5 ,) is 
selected uniformly at random. Let r t be the resulting observed outcome. Based on these, the learner 
can obtain a noisy but unbiased version of the hidden preference matrix P, . Specifically, let us 
define a matrix P, where Pfa,. bf) = N 2 r,, and all other entries P,(a. b ) are set to zero. It can be 
verified that the expected value of each entry Pfa, b ) is exactly Pi(a, 6); that is, E[P,; j Pj] = Pj. 

In Appendix D, we extend our setting to an arbitrary unbiased estimator P, of Pj, and in partic¬ 
ular to an arbitrary exploration strategy that does not change adaptively over time. This extension 
is parameterized by upper bounds on the absolute value and the variance of Pfa. b) for all rounds i 
and all actions (a, b). For uniform exploration, both upper bounds are K 2 . 

While non-adaptive exploration strategies usually lead to suboptimal statistical performance, 
they are often preferable in practice. This is because in large-scale industrial applications the exist¬ 
ing infrastructure is often insufficient to support a feedback loop that would update the exploration 
strategy adaptively over time, and upgrading the infrastructure may be infeasible in the near term. 
Statistical guarantees. With these noisy versions of the empirical preference matrices, we can 
estimate the expected outcome in a “meta-duel” between two policies ir and p, that is, an entry of 
the matrix M defined in Eq. (2). In particular, let 

1 m 

M(n,p) = — y2Pi(Tr(xi),p(xi)). ( 5 ) 

m z J 

2—1 


Then the expected value of this quantity is M ( 7 r, p), the corresponding entry of M. Moreover, using 
Bernstein’s inequality and the union bound, we can show that, with probability at least 1 — 5, 


M(tt, p) — M ( 7 r, p) < s' for all ( 77 , p) £ II x II, 


( 6 ) 


where e' = O (K y/'lnf11T|/A)/m). 3 Thus, although huge, the matrix M is well-approximated 
by the matrix M using only a moderately sized sample. In fact, to find an approximate maxmin 
strategy for M it suffices to find one for M, which will be the approach taken by our algorithms. 


Lemma 3 Given the set-up above, suppose that Eq. (6) holds (as will be the case with probability 
at least 1 — 5), and suppose further that W £ A|n| is a probability vector for which 

min w t mu > max min W t MU - £. 

UGA| n | AVGAjni TJEA|n| 

Then W is a (2e' + e) -approximate von Neumann winner for M. 


A more compact version of the problem. Our aim now is to find an approximate maxmin strategy 
for the matrix M. Although this matrix is gigantic in both dimensions, by leveraging how it was 
constructed from only a small number of empirical observations, we can re-express the problem in a 
far more compact form. To this end, let us define, for each policy tt £ I I, a policy vector v T £ 
that encodes the behavior of 7 r on the exploration data. For readability, although a vector, we index 
entries of v- by pairs (i. a), where i is a round and a is an action, and we define 

v n (i,a) = l{7r(xj) = a}/s/m. 

3. The proof for Eq. (6) and the subsequent Lemma 3 are in Appendix D.l. 
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Thus, Vjr is broken into m length-/! blocks, with block i encoding in a natural way the action 
selected by ir on x*. (The constant 1/yTn is for normalization.) 

We also define an mK x mK block-diagonal matrix B, where the m blocks along the diagonal 
are exactly the K x K matrices P, described above. Formally, using the earlier indexing, 

B((i, a), (j, b )) = 1 {i = jjPfia , b). 

Working through these definitions, it can be verified that for any two policies tt and p, the 
quantity v JBv p is exactly equal to M(it, p ) as defined in Eq. (5). This means that if W and U are 
probability vectors over II, then 


W t MU = (X^Tren W{tt) v„-) T B (X pen U(p) v p ). 

Therefore, the problem of finding a maxmin strategy for M is equivalent to solving 

max min w T Bu (7) 

wgC ugC 

where C is the convex hull of the set of all policy vectors {v„- : tt <G 11} (henceforth, the policy hull). 
Furthermore, a solution w £ C is necessarily a convex combination of vectors v^, and therefore 
corresponds to a probability vector over policies. 

The formulation given in Eq. (7) shows that B should itself be viewed as a game matrix, and 
that our remaining goal is to approximately solve this game. This matrix has the advantage of being 
far smaller than M. ffowever, unlike a conventional matrix game, the space from which the players’ 
vectors w and u are chosen is not the standard space of probability vectors over actions, but rather 
the convex hull of an exponentially large set of vectors. 

Classification oracle. Our algorithms assume that the policy space II is structured in a way that 
admits a certain computational operation that is quite natural in the realm of learning. Specifically, 
we assume the existence of a classification oracle. The input to this oracle is a sequence of cost 
vectors ci,..., c m , each in M A , with the interpretation that c,(a) is the cost of choosing action a 
on context x*. The output of the oracle is the policy in II with minimum cost, that is, 

m 

argminV' a(-ir(xi)). (8) 

Tren i=1 

Indeed, regarding the Xj’s as examples, the actions a as labels or classes, and the policies n as clas¬ 
sifiers, we see that this oracle is in fact solving an empirical, cost-sensitive, multi-class classification 
problem. Thus, the assumption of such an oracle is an idealization based on the numerous cases 
in which effective classification algorithms already exist. In practice, the policy space II is usually 
defined as the space of all possible policies returned by a given classification algorithm, and we 
hope that our methods will be effective when using ordinary off-the-shelf classification algorithms 
as oracle. 

Equivalently, the classification oracle can be described in terms of policy vectors. Specifically, 
the cost vectors above can be identified with their - concatenation, a single vector c £ W nK , divided 
naturally into m blocks. Then the problem given in Eq. (8) is the same as 

argminc • w, 

wgC 
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where the argmin is over the policy hull C defined above. This is because the minimum, without 
loss of generality, will be a policy vector v,,-, where n minimizes Eq. (8). Therefore, in what follows, 
we use expressions of this latter form to indicate an invocation of our assumed classification oracle. 
Algorithms and end-to-end guarantees. We design algorithms that compute an approximate von 
Neumann winner W by solving the optimization problem in Eq. (7). Although there exist many 
methods for solving such a game, the challenge here is the requirement that the solution be in the 
policy hull C. As already seen in Section 5, regret minimization algorithms are a natural choice. 
However, most standard algorithms will not conform to this constraint. In the sections that fol¬ 
low, we provide two algorithms: Algorithm SparringFPL that builds on the Follow-the-Perturbed- 
Leader algorithm of Kalai and Vempala (2003), and Algorithm ProjectedGD that builds on online 
projected gradient descent methods of Zinkevich (2003). 

For a given approximation quality, the performance of either algorithm is characterized by sev¬ 
eral quantities: the sufficient number of exploration rounds, the running time, the storage require¬ 
ment, and the number of policies in the support of W. As it turns out, the key quantities arc the 
number of exploration rounds and the number of oracle calls. We assume each oracle call returns 
both a policy vector and a corresponding policy, each representable using b bits. 4 The solution W 
is specified by explicitly listing the probabilities and policies in its support. 

Theorem 4 Consider K actions and a policy class n with b-bit representation. Fix parameters 
e, 5 > 0. Both SparringFPL and ProjectedGD compute an e-approximate von Neumann winner 
W with probability 1 — 5 using m = 0((K' 2 /e 2 ) In(|II|/A)) exploration rounds with uniform 
exploration strategy. The number of oracle calls is N = 0((ii 6 /e 4 ) ln(|n|/<5))/or SparringFPL 
and N = 0(/i 8 /e 4 )/or ProjectedGD. For both algorithms, disregarding oracle calls, the running 
time is 0{mKN), the storage requirement is 0(bN), and W is a distribution over at most N 
policies. 

While SparringFPL is very simple and intuitive, ProjectedGD achieves a better number of 
oracle calls whenever I\ <C ln (j 111 /r5 ). 

Our algorithms can be used in the full-explore-exploit setting as well: after m exploration rounds 
with uniform exploration strategy, W is computed and used in the remaining rounds for both ac¬ 
tions. The parameter m is chosen in advance as a function of the time horizon T. The statistical 
performance is expressed via regret, as defined in Eq. (3). The total running time is dominated by 
the time to compute W. 5 

Theorem 5 (regret) Consider K actions, time horizon T > K, and a policy class n with b-bit 
representation. Fix a parameter 5 > 0. Both SparringFPL and ProjectedGD achieve regret 
0(K 2 / 3 T 2 / 3 vp 1 / 3 ) with probability 1 — 5, where T 1 = ln(|n|/<5j. The number of oracle calls is 
N = 0{K 10 / 3 T 4 / 3 T--V 3 )/or SparringFPL and N = 0{K 16 / 3 T 4 / 3 T'^ 4 / 3 ) for ProjectedGD. 
For both algorithms, disregarding oracle calls, the total running time is 0(mKN), the storage 
requirement is 0(bN), and the number of exploration rounds is m = 0(K 2 / 3 T 2 / 3 'I' 1 / 3 ). 

4. Often, policies are specified as a parameter vector to some algorithm that implements them. For finite classes, it is 
usually the case that b is roughly 0(ln jn|). 

5. The running time to execute W in each of the exploitation rounds (i.e., to compute the random action for a given 

context) is a low-order term; we omit further details from this version. 
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7. Solving the compact game with SparringFPL 

Our first algorithm to solve Eq. (7), SparringFPL, is based on the Follow-the-Perturbed-Leader 
(FPL) algorithm of Kalai and Vempala (2003). FPL is designed for a standard online learning 
problem: Let V and C be subsets of . On each round t = 1,..., N, the learner chooses a 
decision vector d t G V, and then receives a loss vector it G C. The learner’s goal is to minimize 
its cumulative loss Yht= i dt • it relative to the best possible loss using a fixed decision, that is, 
mhidcD Xd=i d' f/.- FPL chooses d/ as the best such vector based on a slightly perturbed version of 
the preceding losses. Namely, letting p f G M mX be chosen uniformly at random from [0,1 /a] mK , 

d t = argmin d ■ (£r=i i T + p t ). 
dex> v 7 

We solve Eq. (7) by sparring two copies of FPL, called row-FPL and column-FPL, in the fashion 
of a repeated game. On every round t, row-FPL uses FPL to select a vector w t, while column-FPL 
uses a different copy of FPL to select a vector u/. We then define the resulting loss vectors to be 
—But for row-FPL, and B w/ for column-FPL. Here is the complete algorithm: 

• For t = 1,..., N: 

- Choose uniform random perturbations p t , q 4 from [0,1 /a\ mK . 

- Let w t = argmin weC w • [-B(m -|-b u t _i) + p t j. 

- Let u t = argmin ueC u ■ [B T (wi H-b w 4 _i) + q t ]. 

• Output W = d, W i 

The argmin expressions in the algorithm are implemented using the classification oracle. The 
returned vector w is in C, and in fact corresponds to a uniform mixture of N policies. 

In Appendix D.2, we show that to find an e-approximate solution to Eq. (7) with probability 
1 — 5, it suffices to use N = 0(K 4 /£ 2 )(m + ln(l/5)) steps of the algorithm with cr = \J2/(K 4 N), 
which in turn implies Theorems 4 and 5 for SparringFPL. 

8. Solving the compact game with ProjectedGD 

Our second algorithm, called ProjectedGD, solves Eq. (7) using online projected gradient descent 
methods as studied by Zinkevich (2003). The algorithm maintains a vector w t G C corresponding 
to a strategy for the row player. On every round, a column strategy u t G C is chosen that is a “best 
response” to w/. The strategy w/ is updated by taking a small gradient step. The resulting vector 
z t+ i is likely to be outside the set C, and therefore is (approximately) projected back to C, yielding 
wj + i. The algorithm is as follows: 

• Choose any wi G C 

• Fort = 1,..., N out : 

- u t = arg min ueC w t T Bu 

- z i+ i = wj + ?/Bu f 

- wt+i = ApproxProject(zt + i, w f ) 

. omput w = ^ E,=r w, 
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Ideally, we would like for w i+ i to be the exact Euclidean projection of z /+ i onto C, but instead 
need to settle for an approximation. For this puipose, the procedure ApproxProject(z, vi), de¬ 
scribed below, computes an approximate projection of an arbitrary vector z onto C. It takes as an 
input a second vector vi that is already in C, and which we can think of as an initial guess at the 
actual projection. The quality (as an approximation) of the returned vector v is allowed to depend 
on how close vi is to z. Specifically, we require that, for all s E C, and a constant a specified later, 

||s — v11 2 < ||s — z|| 2 + a ■ ||vi — z||. (9) 

In Appendix D.3, we show that with the parameter // = 2/(L\JN ov j) our algorithm finds an 
e-approximate solution to Eq. (7), where e = 2 L / \jN out + La/2. 

Computing approximate projections. It remains to describe the approximate-projection procedure 
ApproxProject(z, vi). Given an arbitrary vector z and another vector vi E C, the goal of the 
algorithm, as in Eq. (9), can be restated as that of finding a vector v E C for which 

minF(s, v) > —a • ||vi — z|| (10) 

see 

where we define F(s, v) = ||s — z11 2 — ||s — v11 2 = 2s- (v —z) + ||z|| 2 — ||v|| 2 . Note that F is linear 
in s (for each v), and concave in v (for each s). To ensure that Eq. (10) holds, we give an algorithm 
that aims to maximize the left-hand side of this inequality. (As a side note, the maximizing vector 
turns out to be exactly the projection of z onto C, although we do not require that fact for our 
algorithm and analysis.) 

To this end, we use an algorithm that resembles repeated play of a game in which the payoff 
is defined by F. The s player uses best response on each round, while the v player again uses a 
variant of online gradient ascent applied to the function F( s*, -). The algorithm takes a parameter 
u E (0,1], and uses vi E C, which was provided as an argument to ApproxProject(z, vi), as the 
initial vector. Here is the algorithm: 

• For t = 1,, N in : 

- s t = argmin seC s • (v t - z) 

- v t+1 = (1 - v)v t + vs t 

• Output w = 

Note that v t is in C for every t (by convexity of C), and therefore v is as well. 

In Appendix D.4, we show that ApproxProject(z, vi) with parameter v = ||z — vi||/-^iVi n 
computes v that satisfies Eq. (10) with a = 8 /y/Ni n . We optimize the choice of N in and N out to 
show that one can obtain an e-approximate solution to Eq. (7) using only 0(K 8 /£ 4 ) oracle calls. 
This in turn implies Theorems 4 and 5 for ProjectedGD. 
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Appendix A. Failure of the Condorcet winner to exist 

Here, we investigate the reliability of the Condorcet assumption by replicating the experiment of 
(Zoghi et ah, 2014a, Section 6.1) with a small modification. As in their setting, we consider a fam¬ 
ily of A'-armed dueling bandit problems arising from the ranker evaluation problem in IR, where 
the comparisons are carried out using Probabilistic Interleave (Hofmann et al., 201 1) and the prefer¬ 
ences are generated using click models simulating user behavior (Guo et ah, 2009a, b). The rankers 
are sampled randomly from the set of 136 rankers provided with the MSLR dataset. However, 
unlike the experiments of Zoghi et al. (2014a), we use an informational click model, rather than a 
perfect one (Hofmann et al., 2013). The former simulates the behavior of a user who is seeking 
general information about a broad topic, while the latter represents an idealized user, who meticu¬ 
lously examines every document in the retrieved list. We believe that the informational click model 
is more realistic and therefore use it here. 

The plot in Figure 1 shows the probability with which the encountered dueling bandit prob¬ 
lems contain Condorcet winners. As this figure demonstrates, in this setting, the occurrence of the 
Condorcet winner drops rapidly as the number of rankers grows. 

This shows that even in this simple non-contextual example the assumption that there exists a 
Condorcet winner is too unreliable to be practical. Needless to say that in the contextual dueling 
bandit problem, where one is dealing with a potentially very large and diverse set of policies, the 
likelihood of one policy dominating every single other policy is even more unrealistic. 

Appendix B. Comparison between the Copeland and von Neumann winners 

A Copeland winner is defined to be any arm that beats the largest number of other arms. It is a 
generalization of the Condorcet winner in the sense that if the Condorcet winner exists, it will be a 
Copeland winner. However, we claim that the von Neumann winner is a more natural generalization 
than the Copeland winner for the following two reasons: first, in the absence of a Condorcet winner, 
Copeland winners, both individually and as a collective, can lose to an arm that is not a Copeland 
winner, whereas the von Neumann winner beats or ties with every single arm; second, the set of 
Copeland winners can be altered by the introduction of “clones,” i.e., arms whose corresponding 
rows of the preference matrix are identical to each other. 

To demonstrate this lack of stability of Copeland winners, consider any K + 3-armed example 
with K > 4 , where arms ai, a 2 and a 3 beat all other arms and the three of them arc in a cycle, 
with 01 beating a 9 , a 2 beating 03 and 03 beating a\ all with probability 1. It is easy to see that 
these three arms are the only Copeland winners with Copeland score equal to K + 1 and also form 
the support of the von Neumann distribution: indeed, the von Neumann distribution is simply the 
uniform distribution on these three arms. Now, let us consider a slight modification of this problem, 
where we add one more arm, called ao, which is a duplicate of arm 01 ; hence P(0,1) = 0 and 
P(0,j) = P(l, j) for all j > 1. In the following we explain what happens to the set of Copeland 
winners after this modification. In the presence of ties, there are three sensible definitions that 
one could use for the Copeland score; these definitions and the corresponding scores for the top 
four arms in our modified example can be found in Table 1. As the quantities in this table show, 
regardless of the definition of the Copeland score used, the set of Copeland winners for our new 

6. http : //research . microsoft . com/en-us/projects/mslr/default . aspx 
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Figure 1: The probability that the Condorcet assumption holds for subsets of the feature rankers in 
the MSLR dataset. The probability is shown as a function of the size of the subset of 
rankers under consideration. 


Table 1: Copeland scores of the top arms in the duplicated example 


The Copeland score 
variant (for a*) 

10': P(iJ)> °}l 

10 ': P(iJ)> 0 }| 

10 ': P{h3)> 0 }| 
- 10 ': p (iJ)< 0 }| 

i = 0,1 

K + l 

K + 3 

K 

i = 2 

K + 1 

K + 2 

K - 1 

i = 3 

K + 2 

K + 3 

K + l 


K + 4-armed dueling bandit problem does not contain all of ao, .... 0 , 3 . Indeed, under no definition 
can arm a 2 be considered a Copeland winner. 

On the other hand, arms 00 , 01 , 02 , 03 still form the support of the von Neumann distribution 
of this modified dueling bandit problem: if we assign weights ui(0), ru(l), w(2), w(3) to these four 
arms such that 

m(0) + w(l) = w( 2) = w(3) = - 

O 

and sample an arm a* according to these weights, we will (on average) beat any a :J with j > 3 and 
tie with all aj with 3 < 3. 

We consider the lack of stability under cloning illustrated by this example to be a major draw¬ 
back of the Copeland score as a measure of quality. 
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Number of arms 

Figure 2: The percentage of preference matrices of a given size, sampled from the MSLR dataset, 
for which all Copeland winners are contained in the support of the von Neumann winner 


Furthermore, as the following 5-armed preference matrix illustrates, the von Neumann winner 
does not necessarily contain the Copeland winner in its support: 



0 

0.5 

-0.5 

0.5 

-0.95 


-0.5 

0 

0.5 

-0.2 

0.5 

p = 

0.5 

-0.5 

0 

-0.2 

0.5 


-0.5 

0.2 

0.2 

0 

0.5 


0.95 

-0.5 

-0.5 

-0.5 

0 


Indeed, the von Neumann winner of this matrix is the uniform distribution on the first three 
arms, as can be easily checked by multiplying the row vector [1/3 1/3 1/3 0 0] with P, while the 
Copeland winner is the fourth arm, since the Copeland scores of the 5 arms are [2 2 2 3 1]. Moreover, 
the fourth arm also happens to be both the Borda winner (Urvoy et al., 2013) and the Random Walk 
winner (Negahban et al., 2012; Busa-Fekete and Hiillermeier, 2014). The Borda winner is the arm 
with the highest chance of winning a comparison against a uniformly randomly chosen opponent, 
i.e., the arm corresponding to the row in the preference matrix whose entries have the highest sum: in 
this case the Borda scores are [0.455 0.53 0.53 0.54 0.445]. The Random Walk winner is obtained 
as follows: first, we convert the “probabilistic” preference matrix (i.e., P/2 + 0.5) into a column¬ 
wise stochastic matrix (by dividing each column by its sum), then find the stationary distribution of 
the Markov chain defined by this matrix (by finding the right eigenvector of the stochastic matrix 
corresponding to eigenvalue 1), and, finally, declare the arm with the highest probability under this 


17 















Dudik Hofmann Schapire Slivkins Zoghi 


10 

in 

If) Q 

o o 


o 

CO 

E 0 


Figure 3: The fractions of preference matrices of a given size (horizontal axis) with a given maxi¬ 
mum Copeland loss among the arms in the support of their von Neumann winner (vertical 
axis): given I\ and M, the area of the circle at coordinate [K. M ) of this plot is proporti- 
nal to the percentage of A'-armcd sub-matrices of the Informational MSLR preference 
matrix, P, for which the maximum Copeland loss of an arm in the support of the von 
Neumann winner of P is equal to M. 


stationary distribution to be the Random Walk winner. In this particular example, the stationary 
distribution is [0.198 0.212 0.204 0.217 0.169] and so the Random Walk winner is the fourth arm, 
as mentioned before. 

Despite the above observations, in practice, the Copeland winner and the von Neumann winner 
tend to agree to a large extent. For instance, in preference matrices sampled from the MSLR dataset, 
as described in Appendix A, in over 99.9% of the examples, the von Neumann winner contained 
at least one Copeland winner. Moreover, in the overwhelming majority of the cases, all Copeland 
winners were assigned non-zero probability by the von Neumann winner, although the percentage 
of cases where this phenomenon occurs is slightly lower than the above figure and dependent on the 
number of arms (see Figure 2). Based on these observations, the Copeland winner can roughly be 
thought of as a more restrictive notion than the von Neumann winner. 

Furthermore, as Figure 3 demonstrates, the arms in the support of the von Neumann winner 
tend to have high Copeland scores (or equivalently, low Copeland losses) in practice. Given this 
close relation between these two notions of winners, a natural question becomes whether the recent 
improvements made in solving the Copeland dueling bandit problem (Zoghi et al., 2015a) can be 
used to speed up the task of finding the von Neumann winner. 

Another aspect of the von Neumann winner that might be disconcerting when first encountered 
is the fact that it is a distribution, which in theory can put non-zero probability on all arms; however, 
in practice, this is very far from being the case. Indeed, among the over a million preference matrices 
sampled from the MSLR dataset, not a single one had a von Neumann winner that assigned non-zero 
probability to more than 5 arms. In fact, in the vast majority of the cases, the von Neumann winners 
had supports of size 1 or 3 (see Figure 4). Note that fewer than 0.03% of preference matrices had 
von Neumann winners whose support contains 5 arms. 
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Figure 4: The percentages of preference matrices of a given size with von Neumann winners of a 
given support size: the bottom plot is simply a zoomed version of the top one, and it was 
included, because preference matrices with von Neumann winners of support size 5 were 
very infrequent. 


Appendix C. Analysis of SparringEXP4.P (proof of Theorem 2) 

To fulfill the requirements of the learning model for Exp4.P, we also need to define rewards r L (a) 
for all of the actions that were not chosen. Furthermore, these rewards need to be defined before 
each copy of the algorithm chooses its action (or, more technically, in a manner that is condition¬ 
ally independent of each copy’s choice). To this end, for every pair of actions (a, b), we define a 
{—1, +l}-valued random variable Rt(a, b ) with the expected value Ptfa. b). Thus, Rfa. b ) can be 
viewed as the outcome of a hypothetical duel between actions a and b. These values are only used 
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for the mathematical argument, and do not literally have to be computed. Only the pair (at, bt) is 
actually used in a duel. 

For row-Exp, based on column-Exp's chosen action bt, we then define rewards r[ row \a ) = 
Rt(a, bt) for all a. And similarly, for column-Exp, based on row-Exp's chosen action at, we define 
rewards r[ col \b) = —R t (at,b) for all b. In particular, this means that row-Exp receives, for its 
chosen action at, the reward Rt(at, bt) (that is, the result of a duel between at and bt), while column- 
Exp receives the reward —R t (at, bt) for its chosen action bt. 

Let us first take the point of view of row-Exp. Plugging in to Eq. (4), we have that, with 
probability at least 1 — 5/4, for all n G II, 

T T 

Y R t(at, bt) > Y M*(xt), bt) - 0(\J KT ln(|II|/ 5)). 
t =l t=i 

Further, using Azuma’s lemma and union bound, we can show that, with probability at least 1 — 5/4, 
for every tt G II 


T T 

Y Rt^(xt),b t ) > Y PtMxt),bt) - o(V^rin(|n|/5)). 

t =i t =i 

Similarly, from column-Exp’s perspective, with probability at least 1 — 5/4, for all n G II, 

T T 

~Y R t(at,b t ) >~Y R t (atMxt)) - 0(\JKT ln(|II|/5)) 

t =l t =l 

and, by Azuma’s lemma and the skew-symmetry of Pt, with probability at least 1 — 5/4, for every 

7T G n, 

T T 

-Y R t{at^{x t ))>Y P t{ v T(x t ),a t ) - 0( V / A'rin(|n|/5)). 

t= l t =l 

Combining and rearranging now yields the theorem. 

Appendix D. Analysis of SparringFPL and Pro j ectedGD 

Compared to the presentation in the body of the paper, we extend our setting from uniform explo¬ 
ration strategy to an arbitrary unbiased estimator P, of P,, i.e., to any matrix P ; ; which satisfies 
E P, | P i] = P ; ;. Our results are parameterized by two numbers, L,V, such that P, (a, b)\ < L and 
Var( P, (a, b)) < V for all exploration rounds i and all action pairs (a, b). For uniform exploration, 
both upper bounds are K 2 . 

D.l. Reduction from M to M (proof of Lemma 3) 

In this subsection, we prove Lemma 3 which reduces the optimization problem to that on the ap¬ 
proximate matrix M computed from the data. As a first step, we prove Eq. (6) which relates M to 
the true preference matrix for M. 
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Proof [Eq. (6)] Let Z 1 ,..., Z m be independent, identically distributed random variables, each tak¬ 
ing values in [— R, i?], and each having mean zero and variance V. Then according to Bernstein’s 
inequality, the probability that the average A = (1/m) Y^=i exceeds some value s is at most 

( *72 A 

™ V \-mV + Rs/?,)- 

For an appropriate choice of s, this implies that, with probability at least 1 — 5, 


A< 2Rhx{l/S) | j 2VHm 
3 m V m 


( 11 ) 


To derive Eq. (6), for a fixed pair of policies 7r and p, we can let Z* = Pi(n(xi), p{xi))—M( tt, p) 
whose mean is zero and variance is at most V ; further, Z, | < 1 + L. Plugging into Eq. (11) implies 
that Eq. (6) holds with 

zvH\nf_/5) 

m 

with probability at least 1 — 5/|fI| 2 . By the union bound, with probability at least 1 — 5, this will 
hold simultaneously for all policies n and p. ■ 


2(1 + L) ln(|II| 2 /5) 
3m 


Proof [Lemma 3] Eq. (6) implies that W T MU is within e' of W T MU, for all probability vectors 
W and U. Therefore, 


min W t MU > 

min W t MU — e' 


UeAjn| 

UeA| n | 


> 

max min W T MU — 

e' — e 


WeA| n | UeA| n | 


> 

max min W T MU — 

2e' — £ 


WeA| n | UeA| n | 


= 

-(2e' + e). 



D.2. Analysis of SparringFPL 

To analyze SparringFPL, we build on the provable guarantees for FPL. 

For convenience, let us recap the learning setting for FPL. Let V and C be subsets of M mA . On 
each round t = 1,... ,N, the learner chooses a decision vector d/ E V, and then receives a loss 
vector l t G C. The learner’s goal is to minimize its cumulative loss Yl‘t=\ df' relative to the best 
possible loss using a fixed decision, that is, mindeD YltL i d ■ f-t- 

Kalai and Vempala (2003, Theorem 1.1) prove the following (slightly simplified) result: assume 
that D, R and A are such that for all d € V and t E C we have that ||d||i < D, |d • l\ < R and 
||£||i < A. Also, let a = \J2D /(RAN). Then, for any sequence i \,..., € C, 

N 

< min V d -£ t + 2 V2DRAN. (12) 

deD ' 
t =i 


E 


N 




t =i 
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where the expectation is over the random choice of perturbations. Kalai and Vempala prove this in 
the oblivious case when the adversary has fixed the £ t ’s ahead of time. However, this restriction can 
be relaxed to allow each it to be selected adaptively in a possibly stochastic fashion that may depend 
on the entire preceding history through round t — 1, but not on the perturbation p/ for the current 
round. Using a martingale argument and Azuma’s lemma (see also Cesa-Bianchi and Lugosi, 2006, 
Lemma 4.1), it can then be shown that, with probability at least 1 — 5', 

1 N ! N / _ / _ 

f £t< min — ^ d • £ t + 2\j2DRA/N + 2R^2 \n(l/6')/N. (13) 

i=t e v t=1 

Theorem 6 In SparringFPL, set parameter a = \J‘2/ (L 2 N). Then with probability at least 1—5, 
the vector w returned by the algorithm satisfies 

minw T Bu > maxminw T Bu — 2e 

ugc weC uec 

where e = 2L\j2m/N + 2Lyj2 ln(2/<5)/iV. 

Thus, to find an ^-approximate solution, we can choose N to be 0(L 2 /e 2 ){m + ln(l/<5)). This 
also gives a bound on the number of oracle calls (it is called twice per round). 

Proof Note that in our case, we can choose D = y/m, A = Lyjm and R = L. 

Let u = Li u / - Then we have the following chain of inequalities holding with probability 

at least 1 — 5: 


min max w T Bu — £ < 

uSC wSC 

max w t Bu — e 
w ec 


< 

1 N 

-^w f T Bu t 

t =l 

(14) 

< 

min w t Bu + e 

uec 

(15) 

< 

max min w T Bu + e. 
wsC uec 



Here, Eqs. (14) and (15) follow directly from Eq. (13) applied, respectively, to row-FPL and column- 
FPL with 5' = 5/2. Noting that 

max min w T Bu = min max w T Bu, 

wsC uec uGC wsC 


the theorem now follows. 


D.3. Analysis of Pro jectedGD: outer loop 

Using an analysis similar to Zinkevich (2003), but for a fixed learning rate, and taking into account 
the eiTors introduced by imperfect projections, we can show the following: 
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Lemma 7 For the algorithm ProjectedGD with 77 = 2/ (L\/N OV: i ), we have 


N °ut t=1 


where e = 2L/y/N out + La/ 2 . 

Proof For all w £C, we have 


) wJ Bu t > max- / 

^ wee N out ^ 


y w But 


|w — Wf_|_i || 2 — ||w — wt || 2 < ||w — zt+i 11 2 — ||w — wt || 2 + arjL 


= — 2r/(w — Wf) T But + 7 / 2 ||Bu 

< —2r/(w — Wf) T But + r] 2 L 2 + arjL. 


' + ar/1 


(16) 

(17) 


Flere, Eq. (16) uses Eq. (9), applied to our case where we have zt+i — w t = //But. Eq. (17) follows 
from straightforward algebra. Since ||w — wi|| < 2, summing over t = 1,, N ou t yields, for all 
w £ C, 

-4 < ||w - WAT out+ i || 2 - ||w - Will 2 

N 0 ut Nout 

< —2 T} E w T But + 27/ ^2 wJ But + r] 2 L 2 N out + at] L N out . 

t= 1 7=1 

Re-auanging completes the lemma. ■ 


We can prove that the returned vector w is an e-approximate maxmin solution using a technique 
similar to Freund and Schapire (1999). (Alternatively, we could use the average of the Ut’s which 
is an e-approximate minmax solution by the same proof.) 

Theorem 8 The vector w satisfies min w T Bu > maxmin w T Bu — e where e is as Lemma 7. 

ueC weC ugC 


Proof Let u = jN- E/=?' ; u t- Then 

maxminw T Bu > 
wS C ugC 


> 


> 


> 


min w t Bu 

ueC 


7=1 


2 Nout 

min —— ) wJ Bu 

ueC N out 

-1 Nout 

— y 

Nout u eC 
Nout 

- w7But 

Nout £ 


min wJ Bu 


1 


Nout 


max——w T But 
wee N ou t ^ 


7=1 


max w T Bu-e 
wgC 


— £ 


minmaxw T Bu — e. 

ugC wgC 
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D.4. Analysis of Pro jectedGD: inner loop 

It remains to analyze the inner loop of Pro jectedGD, i.e., the approximate-projection procedure 
ApproxProject(z, vi). 

Let v* be the projection of z onto the policy hull C. We can prove the following for this algo¬ 
rithm using | v* — V/ 1| 2 as a potential function. 


Lemma 9 For the algorithm ApproxProject with v = ||z — vi \\/y/Ni n and 6 = 8 ||z — v j 11 / \J Ni n , 
we have 


N b 


Proof We have 


N- 

■ i, in 

1 

> - 

_ N4„ 

Nin 

^2 F ( s t,v* 

)-s. 

t =l 

1 y in 

t =i 


* - vt|| 2 = 

||v* - 

- Vt + v(y t - 

st)|| 2 - 

= 

2v(v* 

: - Vt) • (vt - 

' s i) + 

< 

2a(v* 

: - Vt) • (v t - 

' s t ) + 


< o(F(st, Vi) — F(s t ,v*)) + Av 2 . 
Eq. (18) uses ||v* — st|| < ||vt|| + ||st|| < 2. To see Eq. (19), note that 

2 (v* — V() ■ (v t — St) = 2 v* • v t - 2 ||v t || 2 - 2 s t • (v* - v f ) 


(18) 

(19) 


< 


_* 112 


+ ||vi|| 2 - 21|v 4 1| 2 - 2s t • (v* - Vt) 


2 ] _ 

* 112 "| 


= [2st • (v t — z) + ||z11 2 — ||Vf | 

— [2sf • (v* - z) + ||z|| 2 - ||v u j 
= F(s t , v t ) - F(s t ,v*). 

The inequality here uses the fact that, for any two vectors u and w, we have 2u ■ w < ||u|| 2 + || w| 
Also, 

|| v* — Vi || < || v* — z || + || z — Vi || 


= mm v — z + z — vi 
veC 

< 11 Vi — Z|| + ||Z — Vi ||. 

Summing Eq. (19) for t = 1,..., N in and combining with Eq. (20) gives 
—411 z — vi |j 2 < 


( 20 ) 


I * 112 11 * 112 

l v -Vjv in+ 1 - IIV — V! || 


Ni, 


Ni, 


< a ^2 F(s t , Vf) — v ^ F(st, v*) + 


t =l t =l 

Re-arranging and applying our choice of u completes the lemma. 


Next, we show that v satisfies the specification given in Eq. (10). 
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Theorem 10 For the algorithm ApproxPro j ect, min sg c F(s, v) > —6 with 5 set as in Lemma 9. 
Thus, Eq. (10) holds for v if we set a = 8/\fNff. 


Proof Let 


s = 

-i Yi n 

— Y*t. 

N in ^ 

m t= 1 


Then 



min F(s, v) > 
seC 

2 Yin 

min - > F( s, v+ ) 

sec N in 

(21) 

> 

Yin 

— ^mmF(s,v t ) 

in t=1 


= 

-j Yin 

— ^F(s t ,v t ) 

in t=1 

(22) 

> 

1 

* 

> 

-k> 

1 s 


= 

F( s, v*) — 6 

(23) 

> 

min F(s, v*) — <5 

s GC 


> 

||z- v*|| 2 -<5 > -S. 

(24) 

Eq. (21) uses Jensen’s inequality and the fact that F( s, •) is concave for each s. 

Eq. (22) follows 


from our choice of s t (which minimizes F(-, v^)). Eq. (23) uses linearity of Ff, v) for each v. And 
Eq. (24) uses F (s ,v*) > || z — v * || 2 for all s e C, which follows from simple Euclidean geometry 
and the Pythagorean theorem. ■ 


Finally, combining with Lemma 7 and Theorem 8 , this shows that the overall solution w will 
be an e-approximate maxmin solution where e = 2L/y/N out + 4L/Thus, we can obtain 
any desired value of e by setting N rn = N out = |~36L 2 /£ 2 ~|. The resulting number of calls to 
the classification oracle will be N out + N m N ou t = ()(L 4 /e 4 ). As earlier noted, compared to 
SparringFPL, this bound gives a different trade-off between m and e. For the case that e = O(e'), 
and with e' as in Section 6, this algorithm gives a better bound by a factor of 0((ln |n|)//T 2 ). 
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